Optimising `sortperm!` by shreyas-omkar · Pull Request #90 · JuliaGPU/AcceleratedKernels.jl

shreyas-omkar · 2026-06-24T08:37:17Z

`sortperm!` Throughput (Float32)

Array Size ($n$)	Before	After	Speedup
$2^{14}$	0.215 ms	0.197 ms	1.1×
$2^{18}$	0.541 ms	0.267 ms	2.0×
$2^{20}$	2.185 ms	0.490 ms	4.5×
$2^{22}$	10.668 ms	2.836 ms	3.8×
$2^{24}$	53.453 ms	11.904 ms	4.5×

Note: At $n = 16\text{M}$ ($2^{24}$), sortperm! performance is now within 1.3× of a raw, in-place sort! operation.

`sort!` with `by=` Transform (Float32, $n = 2^{22}$)

Transformation Case	Before	After	Speedup
`identity` (No-op / Baseline)	—	1.674 ms	—
`by=abs`	2.197 ms	1.891 ms	1.16×
`by=x->x^2`	~2.350 ms	1.873 ms	1.25×

…er at large n) merge_sortperm_lowmem! carries a comparator that dereferences v[ix] and v[iy] from global memory on every binary-search step inside the merge pass, making the effective traffic O(n log²n). merge_sortperm! instead copies the keys into shared memory alongside the indices so all comparisons stay in L1/shared memory. Benchmarks on RTX 5080 (CUDA 13.2, Julia 1.12): n=2^18: 0.541 ms → 0.286 ms (1.9×) n=2^20: 2.185 ms → 0.490 ms (4.5×) n=2^22: 10.668 ms → 2.847 ms (3.7×) n=2^24: 53.453 ms → 11.900 ms (4.5×) sortperm! is now within 1.3× of plain sort! across all tested sizes. The public temp kwarg is preserved: it maps to temp_ix in merge_sortperm! (same semantics — a pre-allocated index swap buffer). Tests: extend sortperm testset with full permutation-validity checks, 6 new element types (Int16/UInt16/Int64/UInt64/Float64/UInt8), edge sizes (n=1..2049), data-distribution coverage, comparator options, temp-reuse, exact Base.sortperm match, and a merge sort stability check. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Without hoisting, the by(elem) transform fires inside every binary-search comparison step across all O(n log²n) merge operations. With hoisting, we broadcast by.(v) once to build a key array, then delegate to merge_sort_by_key! which keeps keys in shared memory alongside values. Benchmarks on RTX 5080 (Float32, n=2^22): by=abs: 2.197 ms → 1.912 ms (-13%) by=x->x^2: was worse → 1.920 ms rev=true: unchanged (no by, not hoisted) identity: unchanged (guarded by by !== identity check) The temp kwarg maps to temp_values in merge_sort_by_key! preserving the existing API contract. All paths (sort!, merge_sort!, merge_sort_by_key!) now benefit automatically for any non-identity by= function. Tests: add sort_by_transform testset with exact Base.sort output matching for Float32/Float64/Int32, edge sizes (n=1,2,513,1025), temp kwarg forwarding, type-changing by= (Float32→Bool), and identity/rev=true non-regression checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

christiangnrd · 2026-06-24T10:25:35Z

KA has supports_float64(::Backend)

shreyas-omkar · 2026-06-24T10:28:41Z

KA has supports_float64(::Backend)

Yea.. I checked it now. Thanks for giving up heads up. I will change the commit.

shreyas-omkar and others added 2 commits June 24, 2026 13:33

shreyas-omkar force-pushed the sh/sort-optim branch from 2b78b0b to b5f4620 Compare June 24, 2026 10:43

fix: using KernelAbstractions.supports_float64

270e578

shreyas-omkar force-pushed the sh/sort-optim branch from b5f4620 to 270e578 Compare June 24, 2026 10:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimising `sortperm!`#90

Optimising `sortperm!`#90
shreyas-omkar wants to merge 3 commits into
JuliaGPU:mainfrom
shreyas-omkar:sh/sort-optim

shreyas-omkar commented Jun 24, 2026

Uh oh!

christiangnrd commented Jun 24, 2026

Uh oh!

shreyas-omkar commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

shreyas-omkar commented Jun 24, 2026

sortperm! Throughput (Float32)

sort! with by= Transform (Float32, $n = 2^{22}$)

Uh oh!

christiangnrd commented Jun 24, 2026

Uh oh!

shreyas-omkar commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`sortperm!` Throughput (Float32)

`sort!` with `by=` Transform (Float32, $n = 2^{22}$)